Generalized neural closure models with interpretability

Improving the predictive capability and computational cost of dynamical models is often at the heart of augmenting computational physics with machine learning (ML). However, most learning results are limited in interpretability and generalization over different computational grid resolutions, initial and boundary conditions, domain geometries, and physical or problem-specific parameters. In the present study, we simultaneously address all these challenges by developing the novel and versatile methodology of unified neural partial delay differential equations. We augment existing/low-fidelity dynamical models directly in their partial differential equation (PDE) forms with both Markovian and non-Markovian neural network (NN) closure parameterizations. The melding of the existing models with NNs in the continuous spatiotemporal space followed by numerical discretization automatically allows for the desired generalizability. The Markovian term is designed to enable extraction of its analytical form and thus provides interpretability. The non-Markovian terms allow accounting for inherently missing time delays needed to represent the real world. Our flexible modeling framework provides full autonomy for the design of the unknown closure terms such as using any linear-, shallow-, or deep-NN architectures, selecting the span of the input function libraries, and using either or both Markovian and non-Markovian closure terms, all in accord with prior knowledge. We obtain adjoint PDEs in the continuous form, thus enabling direct implementation across differentiable and non-differentiable computational physics codes, different ML frameworks, and treatment of nonuniformly-spaced spatiotemporal training data. We demonstrate the new generalized neural closure models (gnCMs) framework using four sets of experiments based on advecting nonlinear waves, shocks, and ocean acidification models. Our learned gnCMs discover missing physics, find leading numerical error terms, discriminate among candidate functional forms in an interpretable fashion, achieve generalization, and compensate for the lack of complexity in simpler models. Finally, we analyze the computational advantages of our new framework.


SI-1 Adjoint Equations for Neural Partial Delay Differential Equations
In this section, we provide a detailed derivation of adjoint equations for neural partial delay differential equations (nPDDEs). This derivation is inspired by the adjoint equation derivation for a general PDE by Li and Petzold, 2004 [1] and Cao et al., 2002 [2]. Our nPDDE is of the form, (SI-1) As compared to PDEs, PDDEs require the specification of a history function (ℎ( , )) for the initial conditions. ℱ (•; ) and (•; ) are two neural networks (NNs) parameterized by and , respectively. We consider the presence of an arbitrary number of spatial derivatives, with the highest order defined by ∈ Z + . We can rewrite the above equation SI-1 as an equivalent system of coupled PDDEs with discrete delays, ∈ Ω, ≥ 0, Let us assume that high-fidelity data is available at M discrete times, 1 < ... < ≤ , and at ( ) spatial locations ( ∈ Ω, ∀ ∈ 1, ..., ( )) for each of the times. We define the scalar loss function as , where (•) are scalar loss functions such as mean-squared-error (MSE), and (•) is the Kronecker delta function. In order to derive the adjoint PDEs, we start with the Lagrangian corresponding to the above system, )︂ , (SI-3) where ( , ), ( , ) and ( ) are the Lagrangian variables. We then take the derivative of the Lagrangian (equation SI-3) w.r.t. (for brevity we denote, / (•) ≡ (•) , and / (•) ≡ (•) ), )︂ .

(SI-15)
If we restart the above derivation by taking derivative of the Lagrangian (equation SI-3) w.r.t. , we will arrive at the same adjoint PDEs (equations SI-13 & SI-14), and the required gradient will be given by, Finally, using any stochastic gradient descent algorithm, we can find the optimal values of the weights and .

SI-2.1 Architectures
The architectures used to generate the results corresponding to different experiments are provided in

SI-2.2 Hyperparameters
The values of the tuned training hyperparameters corresponding to different experiments are listed next. In all the experiments, the number of iterations per epoch is calculated by dividing the number of time-steps in the training period by the batch-size multiplied by the length of short time-sequences, adding 1, and rounding up to the next integer.

SI-2.2.1 Experiments-1a:
For training, we randomly select short time-sequences spanning 3 time-steps (batch-time) and extract data at every time-step to form batches of size 16; 4 iterations per epoch are used; an exponentially decaying learning rate (LR) schedule is used with an initial LR of 0.075, the decay rate of 0.97, and 4 decay steps; the RMSprop optimizer is employed; training is for a total of 150 epochs. ℒ 1 and ℒ 2 regularization with factors of 1.5 × 10 −3 and 1 × 10 −5 , respectively, is used; the weights are pruned if the value drops below 5 × 10 −3 .

SI-2.2.2 Experiments-1b:
Only Markovian closure case: We randomly select short time-sequences spanning 30 timesteps (batch-time) and extract data at every other time-step to form batches of size 2. In total 24 iterations per epoch are used, with every 8 of them belonging to one of the ( , ) pairs. An exponentially decaying learning rate (LR) schedule with an initial LR of 0.025, a decay rate of 0.95, and 24 decay steps are used; the RMSprop optimizer is used; we train for a total of 30 epochs. We also use both ℒ 1 and ℒ 2 regularization for the weights of the neural network with factors of 5 × 10 −4 and 5 × 10 −4 , respectively, along with pruning of the weights if their value drops below 5 × 10 −3 . Both Markovian and non-Markovian closures case: A batch-time of 30 time-steps is used with data extracted at every other time-step to form batches of size 2; 32 iterations per epoch are used, with every 8 of them belonging to one of the ( , ) pairs; an exponentially decaying learning rate (LR) schedule is employed with an initial LR of 0.01, the decay rate of 0.95, and 32 decay steps; the RMSprop optimizer is used; we train for a total of 30 epochs. We also use both ℒ 1 and ℒ 2 regularization for the weights of the neural network with factors of 1.5 × 10 −3 and 1 × 10 −5 , respectively, along with pruning of the weights if their value drops below 5 × 10 −3 .

SI-2.2.3 Experiments-2a:
Parameter = −100 ; and = 1000 / 3 . For training, we randomly select short time-sequences spanning 3 time-steps (batch-time) and extract data at every other time-step to form batches of size 4; we use 26 iterations per epoch; an exponentially decaying learning rate (LR) schedule is used with an initial LR of 0.075, the decay rate of 0.97, and 26 decay steps; the RMSprop optimizer is adopted; training is terminated at 200 epochs. We also use both ℒ 1 and ℒ 2 regularization for the weights of the neural network with factors of 1.5 × 10 −3 and 1 × 10 −3 , respectively, along with pruning of the weights if their value drops below 5 × 10 −3 .

SI-2.2.4 Experiments-2b:
We use a batch-time of 3 time-steps with data extracted at every other time-step to form batches of size 8; we use 26 iterations per epoch; an exponentially decaying learning rate (LR) schedule is applied with an initial LR of 0.075, the decay rate of 0.97, and 26 decay steps; the RMSprop optimizer is employed; training is terminated at 200 epochs. We also use both ℒ 1 and ℒ 2 regularization for the weights of the Markovian closure with factors of 1.5 × 10 −3 and 1 × 10 −3 , respectively, along with pruning of the weights if their value drops below 5 × 10 −3 . For the neural network in the non-Markovian closure term, only ℒ 2 regularization with a factor of 1 × 10 −5 is used.
Finally, for all the experiments and their multiple repeats with the exact same tuned hyperparameters, we provide variation of training and validation error as training progresses (figure SI-1).   , , and are the carbon-nitrogen ratios for phytoplanktons, zooplanktons, and detritus, respectively. is seawater density.

(d) Experiments-2a
(e) Experiments-2b Figure SI-1: Variation of training (left column) and validation (right column) loss with epochs, for each of the experiments-1a, 1b, 2a, and 2b. We use boxplots to provide statistical summaries for multiple training repeats done for each set of experiments. The box and its whiskers provide a five-number summary: minimum, first quartile (Q1), median (orange solid line), third quartile (Q3), and maximum, along with outliers (black circles) if any. These results accompany the architectures detailed in table SI-1.